1 Introduction

In this project, I plan to implement several regression models (Multiple Linear Regression, random forest, Lasso, Neural Net and etc.) to explore the relation among happiness scores and other factors with the information from United Nations Sustainable Development Solutions Network (UNSDSN).Besides, I plan to find a proper model to predict the happiness score. This dataset is from the Gallup World Poll. This dataset gives the happiness rank and happiness score of 155 countries through 2015 to 2017 around the world based on seven factors including family, life expectancy, economy, generosity, trust in government, freedom, and dystopia residual.

2 Loading and Exploring Data

2.1 Loading required libraries

library(data.table)
library(tidyverse)
library(corrplot)
library(plotly)
library(wildcard)
library(PerformanceAnalytics)
library(DT)
library(maps)
library(caTools)
library(ggpubr)
library(leaps)
library(boot)
library(glmnet)
library(reshape)
library(car)
library(mgcv)
library(gamclass)
library(e1071)
library(randomForest)
library(neuralnet)
library(BSDA)
library(knitr)

2.2 Reading and Cleaning data

wh15 <- fread("~/Downloads/GWU/6214_Barut/Final/world-happiness-report/2015.csv",data.table = FALSE)
wh16 <- fread("~/Downloads/GWU/6214_Barut/Final/world-happiness-report/2016.csv",data.table = FALSE)
wh17 <- fread("~/Downloads/GWU/6214_Barut/Final/world-happiness-report/2017.csv",data.table = FALSE)

wh15$year <- 2015
wh16$year <- 2016
wh17$year <- 2017

names(wh17)[2] <- "Happiness Rank"
names(wh17)[3] <- "Happiness Score"
names(wh17)[6] <- "Economy (GDP per Capita)"
names(wh17)[8] <- "Health (Life Expectancy)"
names(wh17)[11] <- "Trust (Government Corruption)"
names(wh17)[12] <- "Dystopia Residual"

wh15_17 <- bind_rows(wh15,wh16,wh17)
names(wh15_17)[5] <- "SD_error"
wh15_17 <- wh15_17 %>% select(Country:year,-Region, -SD_error)
names(wh15_17) <- c("Country","Happiness_Rank","Happiness_Score","Economy",
                 "Family","Health","Freedom","Trust",
                 "Generosity","Dystopia_Residual","year")

There are 12 variables in this dataset.

text_tbl <- data.frame(
  Variables=c("Country","Region","Happiness Rank","Happiness Score","Standard Error","Economy (GDP per Capita)","Family","Health (Life Expectancy)","Freedom","Trust (Government Corruption)","Generosity","Dystopia Residual"),
  Description=c("Name of the country.","Region the country belongs to.","Rank of the country based on the Happiness Score.","A metric measured in 2015 by asking the sampled people the question:How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.","The standard error of the happiness score.","The extent to which GDP contributes to the calculation of the Happiness Score.","The extent to which Family contributes to the calculation of the Happiness Score.","The extent to which Life expectancy contributed to the calculation of the Happiness Score","The extent to which Freedom contributed to the calculation of the Happiness Score.","The extent to which Perception of Corruption contributes to Happiness Score.","The extent to which Generosity contributed to the calculation of the Happiness Score.","The extent to which Dystopia Residual contributed to the calculation of the Happiness Score."))
kable(text_tbl)
Variables Description
Country Name of the country.
Region Region the country belongs to.
Happiness Rank Rank of the country based on the Happiness Score.
Happiness Score A metric measured in 2015 by asking the sampled people the question:How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.
Standard Error The standard error of the happiness score.
Economy (GDP per Capita) The extent to which GDP contributes to the calculation of the Happiness Score.
Family The extent to which Family contributes to the calculation of the Happiness Score.
Health (Life Expectancy) The extent to which Life expectancy contributed to the calculation of the Happiness Score
Freedom The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government Corruption) The extent to which Perception of Corruption contributes to Happiness Score.
Generosity The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

The seven factors which influence Happiness Score are Economy, Family, Health, Freedom, Trust,generosity and dystopia residual. Sum of the value of these seven factors gives us the happiness score and the higher the happiness score, the lower the happiness rank. So, it is evident that the higher value of each of these seven factors means the level of happiness is higher. We can define the meaning of these factors as the extent to which these factors lead to happiness. Dystopia is the opposite of utopia and has the lowest happiness level. Dystopia will be considered as a reference for other countries to show how far they are from being the poorest country regarding happiness level. After understanding the dataset, I import, arrange and clean the data into R studio. The next section shows some features of this dataset.

2.3 Data size and structure

head(wh15_17)
names(wh15_17)
##  [1] "Country"           "Happiness_Rank"    "Happiness_Score"  
##  [4] "Economy"           "Family"            "Health"           
##  [7] "Freedom"           "Trust"             "Generosity"       
## [10] "Dystopia_Residual" "year"
str(wh15_17)
## 'data.frame':    470 obs. of  11 variables:
##  $ Country          : chr  "Switzerland" "Iceland" "Denmark" "Norway" ...
##  $ Happiness_Rank   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Happiness_Score  : num  7.59 7.56 7.53 7.52 7.43 ...
##  $ Economy          : num  1.4 1.3 1.33 1.46 1.33 ...
##  $ Family           : num  1.35 1.4 1.36 1.33 1.32 ...
##  $ Health           : num  0.941 0.948 0.875 0.885 0.906 ...
##  $ Freedom          : num  0.666 0.629 0.649 0.67 0.633 ...
##  $ Trust            : num  0.42 0.141 0.484 0.365 0.33 ...
##  $ Generosity       : num  0.297 0.436 0.341 0.347 0.458 ...
##  $ Dystopia_Residual: num  2.52 2.7 2.49 2.47 2.45 ...
##  $ year             : num  2015 2015 2015 2015 2015 ...
summary(wh15_17)
##    Country          Happiness_Rank   Happiness_Score    Economy      
##  Length:470         Min.   :  1.00   Min.   :2.693   Min.   :0.0000  
##  Class :character   1st Qu.: 40.00   1st Qu.:4.509   1st Qu.:0.6053  
##  Mode  :character   Median : 79.00   Median :5.282   Median :0.9954  
##                     Mean   : 78.83   Mean   :5.371   Mean   :0.9278  
##                     3rd Qu.:118.00   3rd Qu.:6.234   3rd Qu.:1.2524  
##                     Max.   :158.00   Max.   :7.587   Max.   :1.8708  
##      Family           Health          Freedom           Trust        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.7930   1st Qu.:0.4023   1st Qu.:0.2976   1st Qu.:0.05978  
##  Median :1.0257   Median :0.6301   Median :0.4183   Median :0.09950  
##  Mean   :0.9903   Mean   :0.5800   Mean   :0.4028   Mean   :0.13479  
##  3rd Qu.:1.2287   3rd Qu.:0.7683   3rd Qu.:0.5169   3rd Qu.:0.17316  
##  Max.   :1.6106   Max.   :1.0252   Max.   :0.6697   Max.   :0.55191  
##    Generosity     Dystopia_Residual      year     
##  Min.   :0.0000   Min.   :0.3286    Min.   :2015  
##  1st Qu.:0.1528   1st Qu.:1.7380    1st Qu.:2015  
##  Median :0.2231   Median :2.0946    Median :2016  
##  Mean   :0.2422   Mean   :2.0927    Mean   :2016  
##  3rd Qu.:0.3158   3rd Qu.:2.4556    3rd Qu.:2017  
##  Max.   :0.8381   Max.   :3.8377    Max.   :2017

2.4 Sorting Data

Deleting countries that were not listed in all three years. Tip: Some countries underwent a name change such as Hong Kong S.A.R., China to Hong Kong in 2017 for political reasons. While other countries-as admitted by the UNSDSN-did not fulfill the survey or gave the survey to its citizens. In order to minimize issues, the committee used the 2014 data of these countries. This of course affects the accuracy of the data, which was stated under the source validity.

cwithout3years <- wh15_17 %>% group_by(Country) %>% mutate(count = sum(year))
cwithout3years %>% filter(count != 6048) %>% select(Country, Happiness_Rank, year) %>% arrange(Country)

3 Visualization

3.1 Correlation plot

corrplot(cor(wh15_17 %>% 
               select(Happiness_Score:Dystopia_Residual)),  method="color",
         sig.level = 0.01, insig = "blank",addCoef.col = "black", tl.srt=45, 
         type="upper")

According to the above correlation plot, obviously, there is a positive correlation between “Happiness Score” and all the other numerical variables. In other words, the higher the happiness score, the higher the other seven factors that contribute to happiness. Specifically, Economy and Health play the most significant role in contributing to happiness. Generosity have the lowest impact on the happiness score.

3.2 Scatter Plot

Since the Economy and Health play the most significant role in contributing to happiness, I draw a scatter plot of Happiness betwen Economy and Health.

plot_ly(data = wh15_17, x=~Economy, y=~Happiness_Score, color=~Health, 
        type = "scatter",text = ~paste("Country:", Country)) %>% 
        layout(title = "Relationship between Happiness, GDP and Health ", 
               xaxis = list(title = "GDP "),
               yaxis = list(title = "Happiness Score"))

This interactive scatterplot shows that there is a strong positive correlation between GDP and Happiness scores.The points coloured by the Health score suggeest that Health tends to have big impact to happiness.

3.3 Histogram of the World Happiness Scores

hist(wh15_17$Happiness_Score , xlab = "World Happiness Score", main = "World Happiness Score during three years")

It’s easy to see that the World Happiness Score shows a normal distribution. 5 and 6 get the highest frequencey.

3.4 Mapping World Happiness

world <- map_data('world')
world <- world %>% filter(region != "Antarctica")
world <- fortify(world)

#2015 World Happiness Score Map
happiness.score15 <- wh15_17 %>% select(Country, Happiness_Score, year) %>% filter(year == 2015)
happiness.score15 <- wildcard(df = happiness.score15, wildcard = "United States", values = "USA", expand = TRUE, rules = NULL)
happiness.score15 <- wildcard(df = happiness.score15, wildcard = "United Kingdom", values = "UK", expand = TRUE, rules = NULL)
happiness.score15 <- wildcard(df = happiness.score15, wildcard = "Democratic Republic of the Congo", values = "Congo (Kinshasa)",expand = TRUE, rules = NULL)

ggplot() + 
  geom_map(data=world, map=world,
                    aes(x=long, y=lat, group=group, map_id=region),
                    fill="white", colour="black") +
  geom_map(data=happiness.score15, map=world,
           aes(fill=Happiness_Score, map_id=Country),colour="black") +
  scale_fill_continuous(low="red", high="pink",guide="colorbar") + 
  labs(title = "2015 World Happiness Score Map")

#2016 World Happiness Score Map
happiness.score16 <- wh15_17 %>% select(Country, Happiness_Score, year) %>% filter(year == 2016)
happiness.score16 <- wildcard(df = happiness.score16, wildcard = "United States", values = "USA", expand = TRUE, rules = NULL)
happiness.score16 <- wildcard(df = happiness.score16, wildcard = "United Kingdom", values = "UK", expand = TRUE, rules = NULL)
happiness.score16 <- wildcard(df = happiness.score16, wildcard = "Democratic Republic of the Congo", values = "Congo (Kinshasa)",expand = TRUE, rules = NULL)

ggplot() + 
  geom_map(data=world, map=world,
                    aes(x=long, y=lat, group=group, map_id=region),
                    fill="white", colour="black") +
  geom_map(data=happiness.score16, map=world,
           aes(fill=Happiness_Score, map_id=Country),colour="black") +
  scale_fill_continuous(low="red", high="pink",guide="colorbar") + 
  labs(title = "2016 World Happiness Score Map")

#2017 World Happiness Score Map
happiness.score17 <- wh15_17 %>% select(Country, Happiness_Score, year) %>% filter(year == 2017)
happiness.score17 <- wildcard(df = happiness.score17, wildcard = "United States", values = "USA", expand = TRUE, rules = NULL)
happiness.score17 <- wildcard(df = happiness.score17, wildcard = "United Kingdom", values = "UK", expand = TRUE, rules = NULL)
happiness.score17 <- wildcard(df = happiness.score17, wildcard = "Democratic Republic of the Congo", values = "Congo (Kinshasa)",expand = TRUE, rules = NULL)

ggplot() + 
  geom_map(data=world, map=world,
                    aes(x=long, y=lat, group=group, map_id=region),
                    fill="white", colour="black") +
  geom_map(data=happiness.score17, map=world,
           aes(fill=Happiness_Score, map_id=Country),colour="black") +
  scale_fill_continuous(low="red", high="pink",guide="colorbar") + 
  labs(title = "2017 World Happiness Score Map")

Comparing the three graphs above, we can easily see that most part of North America, South America and Australia show higer scores among all countries while most of the Africa countries display lower scores. From the difference of these three pictures, we can find that the score of several Africa and South America counteris getting lower as the year increasing. The situation may caused by the war, economy or other factors.

3.5 Compare Top5, Middle5 and Worst5 Countries

world.happiness17 <- wh15_17 %>% filter(year == 2017)
top5 <- world.happiness17 %>% head(5) %>% mutate(Level = "TOP5")
middle5 <- world.happiness17[76:80, ] %>% mutate(Level = "MIDDLE5")
worst5 <- world.happiness17 %>% tail(5) %>% mutate(Level = "WORST5")
comparison <- bind_rows(top5, middle5, worst5)
comparison$Level <- as.factor(comparison$Level)
comparison <- transform(comparison, Level = factor(Level, levels = c("TOP5", "MIDDLE5", "WORST5" )))

datatable(comparison,
          options = list(
            lengthMenu = c(5, 10, 15)),
          caption = htmltools::tags$caption(style = 'caption-side: bottom; text-align: center;', 
              htmltools::em('Data table that only includes top5, middle5 and worst5 countries'))
          )

From the table we get, the group determined that the happiest countries were located in Europe, particularly Scandinavia and Switzerland. Meanwhile the least happy countries were located in Africa and the Middle East. This suggests that countries in close proximity or those in the same region often have similar living conditions and are thus affected by factors similarly.

We still looking at and analyzing deeper in this dataset to get a much more accurate report. Thus, I’ll do the regression between happiness score and other seven factors in the below section.

4 Regression

4.1 Scatter plot

p1 <- ggplot( data=wh15_17,aes(x = Economy, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 
p2 <- ggplot( data=wh15_17,aes(x = Family, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 
p3 <- ggplot( data=wh15_17,aes(x = Health, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 
p4 <- ggplot( data=wh15_17,aes(x = Freedom, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 
p5 <- ggplot( data=wh15_17,aes(x = Trust, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 
p6 <- ggplot( data=wh15_17,aes(x = Generosity, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 
p7 <- ggplot( data=wh15_17,aes(x = Dystopia_Residual, y = Happiness_Score)) + 
  geom_point(size = 0.5, alpha = 0.8,color="green") +  
  geom_smooth(method = "lm", fullrange = TRUE) 

ggarrange(p1, p2, p3, p4, p5, p6, p7, ncol=3,nrow = 3)

Before fitting the regression, I first take a look at scatter plot between happiness score(dependent variable) and seven factors(independent variables).The above graphs show that while most factors correlate with happiness score in some way, some correlate much more significantly that others. Economy and Health seem to have a strong correlation with the smallest amount of variance whilst Freedom, and Family have a strong correlation but with very high variance. Interestingly generosity seems to have no trend, whether your happiness is low or high there is nonetheless a large variance in generosity.

4.2 Spliting Dataset into training and test set

I split the Dataset into training and test set by using ratio 0.8. first 5 rows of training set and test set.

set.seed(666)
wh_reg <- wh15_17[3:10]
split = sample.split(wh15_17$Happiness_Score, SplitRatio = 0.8)
training_set = subset(wh_reg, split == TRUE)
test_set = subset(wh_reg, split == FALSE)

4.3 Variable Selection

4.3.1 Forward Stepwise Selection

reg_fwd <- regsubsets(Happiness_Score ~ .,data = training_set,nvmax=13,method="forward")
summary(reg_fwd)
## Subset selection object
## Call: regsubsets.formula(Happiness_Score ~ ., data = training_set, 
##     nvmax = 13, method = "forward")
## 7 Variables  (and intercept)
##                   Forced in Forced out
## Economy               FALSE      FALSE
## Family                FALSE      FALSE
## Health                FALSE      FALSE
## Freedom               FALSE      FALSE
## Trust                 FALSE      FALSE
## Generosity            FALSE      FALSE
## Dystopia_Residual     FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
##          Economy Family Health Freedom Trust Generosity Dystopia_Residual
## 1  ( 1 ) "*"     " "    " "    " "     " "   " "        " "              
## 2  ( 1 ) "*"     " "    " "    " "     " "   " "        "*"              
## 3  ( 1 ) "*"     " "    " "    "*"     " "   " "        "*"              
## 4  ( 1 ) "*"     "*"    " "    "*"     " "   " "        "*"              
## 5  ( 1 ) "*"     "*"    "*"    "*"     " "   " "        "*"              
## 6  ( 1 ) "*"     "*"    "*"    "*"     " "   "*"        "*"              
## 7  ( 1 ) "*"     "*"    "*"    "*"     "*"   "*"        "*"
fwd_summary <- summary(reg_fwd)
plot(fwd_summary$bic)

The summary lists the best model for a given size. For instance, the best model of size 1 includes Economy and the best model of size 2 includes Economy and Dystopia_Residual. Besides, in the plot, the lowest BIC is achieved by the 12th model, which includes all of the variables. I will also check the model using cross validation.

CVmse <- rep(0,7)
for(i in 1:7){
  tempCols <- which(fwd_summary$which[i,-1]==TRUE)
  tempCols <- c(tempCols,7)
  tempCols <- as.numeric(tempCols)
  tempGLM <- glm(Happiness_Score~.,data=training_set[,tempCols])
  tempCV <- cv.glm(tempGLM,data=training_set[,tempCols],K = 10)
  CVmse[i] <- tempCV$delta[1]
}
plot(CVmse)

The model with the lowest cross validation error is also the 7th model and the minimum MSE equals 0.32851.

4.3.2 Backward Stepwise Selection

reg_bwd <- regsubsets(Happiness_Score ~ .,data = training_set,nvmax=13,method="backward")
summary(reg_bwd)
## Subset selection object
## Call: regsubsets.formula(Happiness_Score ~ ., data = training_set, 
##     nvmax = 13, method = "backward")
## 7 Variables  (and intercept)
##                   Forced in Forced out
## Economy               FALSE      FALSE
## Family                FALSE      FALSE
## Health                FALSE      FALSE
## Freedom               FALSE      FALSE
## Trust                 FALSE      FALSE
## Generosity            FALSE      FALSE
## Dystopia_Residual     FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: backward
##          Economy Family Health Freedom Trust Generosity Dystopia_Residual
## 1  ( 1 ) "*"     " "    " "    " "     " "   " "        " "              
## 2  ( 1 ) "*"     " "    " "    " "     " "   " "        "*"              
## 3  ( 1 ) "*"     " "    " "    "*"     " "   " "        "*"              
## 4  ( 1 ) "*"     "*"    " "    "*"     " "   " "        "*"              
## 5  ( 1 ) "*"     "*"    "*"    "*"     " "   " "        "*"              
## 6  ( 1 ) "*"     "*"    "*"    "*"     " "   "*"        "*"              
## 7  ( 1 ) "*"     "*"    "*"    "*"     "*"   "*"        "*"
bwd_summary <- summary(reg_fwd)
plot(bwd_summary$bic)

Backward stepwise selection recovers the same selection as forward stepwise selection which agrees the multiple linear regression.

4.3.3 LASSO

lasso.cv <- cv.glmnet(x=as.matrix(training_set[,-1]),y=as.matrix(training_set[,1]),alpha=1,nfolds = 10)
plot(lasso.cv)

The lowest MSE is achieved with small values of the penalty parameter lambda. Lowest MSE is achieved when log(Lambda)=-4, and the MSE is around 0.3. This is a model with 7 coefficients. This is obvious the same model we obtain above which including all variables.

4.4 Nonlinear effects

At this point most of you should be convinced that the best model uses all the variables. Next, we should check if any nonlinear transformations are necessary. ###4.4.1 Multiple Linear Regression First fitting Multiple Linear Regression to the Training set.

reg_lm = lm(Happiness_Score ~ .,data = training_set)
summary(reg_lm)
## 
## Call:
## lm(formula = Happiness_Score ~ ., data = training_set)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.659e-04 -2.414e-04 -2.620e-06  2.574e-04  5.007e-04 
## 
## Coefficients:
##                    Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)       7.876e-05  8.439e-05     0.933    0.351    
## Economy           1.000e+00  6.606e-05 15139.332   <2e-16 ***
## Family            1.000e+00  6.362e-05 15718.154   <2e-16 ***
## Health            9.999e-01  1.020e-04  9803.124   <2e-16 ***
## Freedom           1.000e+00  1.323e-04  7555.868   <2e-16 ***
## Trust             9.999e-01  1.616e-04  6186.889   <2e-16 ***
## Generosity        1.000e+00  1.258e-04  7951.857   <2e-16 ***
## Dystopia_Residual 1.000e+00  2.676e-05 37368.142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002895 on 368 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.328e+08 on 7 and 368 DF,  p-value: < 2.2e-16

The summary result shows that all independent variables have a significant impact, and adjusted R squared is 1! As we discussed, it is clear that there is a linear correlation between dependent and independent variables. Since the sum of the all seven factors is equal to the the happiness score. This is the justification for having an adjusted R squared equals to 1. As a result, I guess Multiple Linear Regression will predict happiness scores with 100 % accuracy! Now, check the prediction result with the test set.

pred_lm = predict(reg_lm, newdata = test_set)

Actual_lm <- as.data.frame(cbind(Prediction = pred_lm, Actual = test_set$Happiness_Score))

gg.lm <- ggplot(Actual_lm, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Multiple Linear Regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm

As I assumed, the plot shows that multiple linear regression works well. Next, plotting the residuals with respect to each covariate.

train.melt <- melt(training_set)
train_melt<- cbind(train.melt,resid=reg_lm$residuals)
ggplot(train_melt,aes(x=value,y=resid))+geom_point()+geom_smooth(method="loess")+facet_wrap(~variable,scales="free")

There are obvious non-linearities with respect to Economy, Trust and Dystopia_Residual variables. I will go in that order and try polynomial regression with different terms for each of the variables. I will choose the order of polynomials using cross validation.

4.4.2 Transforming

Transforming Economy

EcoMSE <- rep(0,10)
for(i in 1:10){
  templm <- glm(Happiness_Score~.-Economy+poly(Economy,i),data=training_set)
  tempCV <- cv.glm(training_set,templm,K = 10)
  EcoMSE[i] <- tempCV$delta[1]
}
plot(EcoMSE)

Minimum is obtained with a third degree polynomial for Economy.

Transforming Trust

TruMSE <- rep(0,10)
for(i in 1:10){
  templm2 <- glm(Happiness_Score~.-Economy-Trust+poly(Economy,3)+poly(Trust,i),
                 data=training_set)
  tempCV2 <- cv.glm(training_set,templm2,K = 10)
  TruMSE[i] <- tempCV2$delta[1]
}
plot(TruMSE)

Minimum is obtained with a third degree polynomial for Trust.

Transforming Dystopia_Residual

DysMSE <- rep(0,10)
for(i in 1:10){
  templm3 <- glm(Happiness_Score~.-Economy-Trust-Dystopia_Residual
                 +poly(Economy,3)+poly(Trust,3)+poly(Dystopia_Residual,i),data=training_set)
  tempCV3 <- cv.glm(training_set,templm3,K = 10)
  DysMSE[i] <- tempCV3$delta[1]
}
plot(DysMSE)

Minimum is obtained with a eighth degree polynomial for Trust.

4.4.3 Polynomial Regression

After we transfer the non-linear independent variables, I get the final model.

reg_poly <- lm(Happiness_Score~
                 poly(Economy,3)+Family+Health+Freedom+poly(Trust,3)+Generosity+poly(Dystopia_Residual,8), data = training_set)
summary(reg_poly)
## 
## Call:
## lm(formula = Happiness_Score ~ poly(Economy, 3) + Family + Health + 
##     Freedom + poly(Trust, 3) + Generosity + poly(Dystopia_Residual, 
##     8), data = training_set)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.655e-04 -2.315e-04  5.620e-06  2.329e-04  5.611e-04 
## 
## Coefficients:
##                               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)                  3.172e+00  8.693e-05 36483.276   <2e-16 ***
## poly(Economy, 3)1            8.053e+00  5.514e-04 14605.597   <2e-16 ***
## poly(Economy, 3)2            4.643e-04  3.542e-04     1.311    0.191    
## poly(Economy, 3)3            4.474e-04  3.206e-04     1.396    0.164    
## Family                       1.000e+00  6.408e-05 15605.489   <2e-16 ***
## Health                       1.000e+00  1.068e-04  9361.388   <2e-16 ***
## Freedom                      1.000e+00  1.338e-04  7471.399   <2e-16 ***
## poly(Trust, 3)1              2.141e+00  4.129e-04  5184.979   <2e-16 ***
## poly(Trust, 3)2              3.649e-04  3.042e-04     1.199    0.231    
## poly(Trust, 3)3             -2.479e-04  3.275e-04    -0.757    0.450    
## Generosity                   1.000e+00  1.310e-04  7631.350   <2e-16 ***
## poly(Dystopia_Residual, 8)1  1.117e+01  3.049e-04 36645.386   <2e-16 ***
## poly(Dystopia_Residual, 8)2 -7.707e-05  2.986e-04    -0.258    0.796    
## poly(Dystopia_Residual, 8)3  3.254e-04  2.939e-04     1.107    0.269    
## poly(Dystopia_Residual, 8)4 -1.994e-04  2.954e-04    -0.675    0.500    
## poly(Dystopia_Residual, 8)5 -1.707e-04  2.981e-04    -0.573    0.567    
## poly(Dystopia_Residual, 8)6 -5.434e-04  2.914e-04    -1.865    0.063 .  
## poly(Dystopia_Residual, 8)7  3.854e-04  2.928e-04     1.316    0.189    
## poly(Dystopia_Residual, 8)8 -2.544e-04  2.920e-04    -0.871    0.384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002881 on 357 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.27e+08 on 18 and 357 DF,  p-value: < 2.2e-16
polyGLM <- glm(Happiness_Score~
                 poly(Economy,3)+Family+Health+Freedom+poly(Trust,3)+Generosity+poly(Dystopia_Residual,8),data = training_set)
cv.glm(training_set,polyGLM,K=10)$delta[1]
## [1] 8.858246e-08

The MSE of the polynomial regression equals to 1.005146e-07 which is obviously smaller than the linear model(0.32851) that we started with.

4.5 Checking interaction effects

reg_lminter = lm(Happiness_Score ~ .^2,data = wh_reg)
summary(reg_lminter)
## 
## Call:
## lm(formula = Happiness_Score ~ .^2, data = wh_reg)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.821e-04 -2.020e-04  3.970e-06  2.358e-04  5.992e-04 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   4.562e-04  2.926e-04    1.559  0.11979    
## Economy                       9.993e-01  3.500e-04 2855.111  < 2e-16 ***
## Family                        1.000e+00  3.130e-04 3195.636  < 2e-16 ***
## Health                        1.000e+00  5.787e-04 1728.575  < 2e-16 ***
## Freedom                       9.989e-01  7.218e-04 1383.864  < 2e-16 ***
## Trust                         1.001e+00  8.479e-04 1180.386  < 2e-16 ***
## Generosity                    1.000e+00  5.848e-04 1710.399  < 2e-16 ***
## Dystopia_Residual             9.999e-01  1.099e-04 9100.127  < 2e-16 ***
## Economy:Family               -7.129e-05  2.125e-04   -0.335  0.73742    
## Economy:Health               -2.791e-04  2.180e-04   -1.280  0.20108    
## Economy:Freedom               1.744e-03  5.481e-04    3.182  0.00156 ** 
## Economy:Trust                -1.804e-04  5.796e-04   -0.311  0.75578    
## Economy:Generosity            7.849e-04  5.310e-04    1.478  0.14010    
## Economy:Dystopia_Residual     5.791e-05  1.111e-04    0.521  0.60237    
## Family:Health                 5.473e-04  3.972e-04    1.378  0.16901    
## Family:Freedom               -2.677e-04  4.219e-04   -0.635  0.52605    
## Family:Trust                 -2.666e-04  6.159e-04   -0.433  0.66541    
## Family:Generosity            -4.028e-04  5.281e-04   -0.763  0.44606    
## Family:Dystopia_Residual     -5.424e-05  1.104e-04   -0.491  0.62351    
## Health:Freedom               -1.601e-03  7.389e-04   -2.166  0.03082 *  
## Health:Trust                  1.140e-03  1.111e-03    1.026  0.30548    
## Health:Generosity            -1.535e-03  1.002e-03   -1.532  0.12618    
## Health:Dystopia_Residual      3.532e-05  1.774e-04    0.199  0.84226    
## Freedom:Trust                -3.664e-04  1.294e-03   -0.283  0.77717    
## Freedom:Generosity            1.165e-03  1.107e-03    1.053  0.29290    
## Freedom:Dystopia_Residual     2.061e-04  2.107e-04    0.978  0.32863    
## Trust:Generosity             -8.158e-04  1.249e-03   -0.653  0.51394    
## Trust:Dystopia_Residual      -5.501e-04  2.432e-04   -2.261  0.02422 *  
## Generosity:Dystopia_Residual  7.007e-05  2.052e-04    0.341  0.73290    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002836 on 441 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.692e+08 on 28 and 441 DF,  p-value: < 2.2e-16

We can conclude that there is no interaction effects from the summary above since the interaction terms are all not significant under alpha equals to 0.05.Then after selecting the final model(polynomial regression), several diagnostics should be checked.

4.6 Checking model diagnostics

4.6.1 Checking Gaussian distribution of residuals.

Checking if residuals relies Gaussian distribution.

wh15new=filter(wh15_17,year==2015)
reg_15 <- lm(Happiness_Score ~poly(Economy,3)+Family+Health+Freedom+poly(Trust,3)+Generosity+poly(Dystopia_Residual,8),
               data = wh15new)

shapiro.test(reg_15$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  reg_15$residuals
## W = 0.9835, p-value = 0.05664

According to the result from Shapiro-Wilk test,the p-value of the ncvTest is higher than a significance level of 0.05, therefore we can’t reject the null hypothesis that the the residuals relies normally and infer that Gaussian distribution assumption of residuals is matched.

4.6.2 Checking homoskedasticity

Checking whether residuals are homoskedastic, in other words, checking if residuals have the same variance.

plot(reg_poly,which=1)

ncvTest(reg_poly)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.04607611, Df = 1, p = 0.83004

The p-value of the ncvTest is higher than a significance level of 0.05, therefore we can’t reject the null hypothesis that the variance of the residuals is constant and infer that homoskedasticity assumption is fit.

4.6.2 Confidence Interval Comparison

Confidence interval from noraml approximation

poly_pred <- predict(reg_poly,training_set,interval = "confidence")
z.test(poly_pred,sigma.x = 6.8)
## 
##  One-sample z-Test
## 
## data:  poly_pred
## z = 26.557, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  4.980159 5.773815
## sample estimates:
## mean of x 
##  5.376987

Confidence interval from Bootstrap

sim<-data.frame(MeanRep=replicate(1e5,mean(sample(training_set$Happiness_Score,replace = T))))
point<-round(mean(sim$MeanRep))
conf<-data.frame(Val=round(quantile(sim$MeanRep,c(.025,1-.025)),digits = 2))


ggplot(sim,aes(x=MeanRep))+
  geom_density(fill="steelblue3",alpha=.33,size=1.4)+
  scale_x_continuous(breaks = seq(0,40,2))+
  geom_vline(size=1.4,linetype=2,aes(xintercept=mean(MeanRep)))+
  geom_vline(size=1.4,linetype=2,color="red",aes(xintercept=quantile(sim$MeanRep,.025)))+
  geom_vline(size=1.4,linetype=2,color="red",aes(xintercept=quantile(sim$MeanRep,1-.025)))+
  theme_bw()+theme(axis.text = element_text(color="black"))+
  labs(x="Mean Happiness Score",y="Density",subtitle=paste(" Point Estimate:",point,"\n","95% Conf. Interval:",conf[1,],"-",conf[2,]))

Under 95% confidence, the normal approximation shows that the point estimate of happiness score’s mean is 5.38 and the confidence interval lies between 4.98 and 5.77 whilist the bootstrap confidence interval falls within 5.26 and 5.49 within the point estimate 5. Thus the normal approximation did better job here than bootstrap.

4.7 Splines

In this part, I’ll use splines for the polynomial terms fitting an additive model.

gam.happy <- gam(Happiness_Score~
                    s(Economy)+Family+Health+Freedom+s(Trust)+Generosity+s(Dystopia_Residual),
               data = training_set)
summary(gam.happy)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Happiness_Score ~ s(Economy) + Family + Health + Freedom + s(Trust) + 
##     Generosity + s(Dystopia_Residual)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.172e+00  8.486e-05   37373   <2e-16 ***
## Family      1.000e+00  6.348e-05   15752   <2e-16 ***
## Health      9.999e-01  1.023e-04    9777   <2e-16 ***
## Freedom     1.000e+00  1.320e-04    7574   <2e-16 ***
## Generosity  1.000e+00  1.263e-04    7920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df         F p-value    
## s(Economy)           1.505  1.871 1.224e+08  <2e-16 ***
## s(Trust)             1.493  1.836 1.881e+07  <2e-16 ***
## s(Dystopia_Residual) 1.000  1.000 1.390e+09  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =      1   Deviance explained =  100%
## GCV = 8.5368e-08  Scale est. = 8.3326e-08  n = 376
CVgam(formula(gam.happy),training_set,nfold=10)
##    GAMscale CV-mse-GAM  
##           0           0

The MSE of this model equals 0 which is perfect and close to the result we get from polynomial regression. Then I try to test the quality of additive model using test set.

pred_add = predict(gam.happy, newdata = test_set)

Actual_add <- as.data.frame(cbind(Prediction = pred_add, Actual = test_set$Happiness_Score))

gg.add <- ggplot(Actual_add, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Additive Model", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.add

As I expected, the result shows that the additive model fits the dataset perfectly.

4.8 Support Vector Regression

Fitting SVR to the training dataset and check the quality of regression using test dataset.

reg_svr = svm(formula = Happiness_Score ~ .,data = training_set,
                    type = 'eps-regression',kernel = 'radial')

pred_svr = predict(reg_svr, newdata = test_set)

Actual_svr <- as.data.frame(cbind(Prediction = pred_svr, Actual = test_set$Happiness_Score))

gg.svr <- ggplot(Actual_svr, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "SVR", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.svr

The plot indicates that SVR shows an excellent job in this dataset.

4.9 Random Forest Regression

Fitting Random Forest Regression to the training dataset and check the quality of regression using test dataset.

set.seed(1234)
reg_rf = randomForest(x = training_set[-1],
                         y = training_set$Happiness_Score,
                         ntree = 500)
pred_rf = predict(reg_rf, newdata = test_set)

Actual_rf <- as.data.frame(cbind(Prediction = pred_rf, Actual = test_set$Happiness_Score))

gg.rf <- ggplot(Actual_rf, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Random Forest Regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.rf

It can be seen that Randon Forest regression is not as good as SVR and multiple linear regression regarding predicted happiness scores.

4.10 Neural Net Model

Fitting Neural Net to the training dataset and check the quality of regression using test dataset.

nn <- neuralnet(Happiness_Score ~Economy + Family + Health  + Freedom + Generosity + Trust + Dystopia_Residual, data=training_set,hidden=10,linear.output=TRUE)

pred_nn <- compute(nn,test_set[,2:8])

Pred_Actual_nn <- as.data.frame(cbind(Prediction = pred_nn$net.result, Actual = test_set$Happiness_Score))

gg.nn <- ggplot(Pred_Actual_nn, aes(Actual, V1 )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Neural Net", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
      axis.title = element_text(family = "Helvetica", size = (10)))
gg.nn

The graphs show that Neural Net is the best predictor after multiple linear regression.

4.11 Polynomial Regression

pred_poly = predict(reg_poly, newdata = test_set)

Actual_poly <- as.data.frame(cbind(Prediction = pred_poly, Actual = test_set$Happiness_Score))

gg.poly <- ggplot(Actual_poly, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Polynomial regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.poly

4.12 Comparison of all models

ggarrange(gg.lm, gg.poly, gg.add,gg.svr,gg.rf, gg.nn, ncol = 2, nrow = 3)

MSE.lm <- sum((test_set$Happiness_Score - pred_lm)^2)/nrow(test_set)
MSE.poly <- sum((test_set$Happiness_Score - pred_poly)^2)/nrow(test_set)
MSE.add <- sum((test_set$Happiness_Score - pred_add)^2)/nrow(test_set)
MSE.svr <- sum((test_set$Happiness_Score - pred_svr)^2)/nrow(test_set)
MSE.rf <- sum((test_set$Happiness_Score - pred_rf)^2)/nrow(test_set)
MSE.nn <- sum((test_set$Happiness_Score - pred_nn$net.result)^2)/nrow(test_set)
print(paste("Mean Squared Error (Multiple Linear Regression):", MSE.lm))
## [1] "Mean Squared Error (Multiple Linear Regression): 0.0000000802962922642319"
print(paste("Mean Squared Error (polynomial Regression):", MSE.poly))
## [1] "Mean Squared Error (polynomial Regression): 0.0000000987572476581496"
print(paste("Mean Squared Error (SVR):", MSE.svr))
## [1] "Mean Squared Error (SVR): 0.0182972257535053"
print(paste("Mean Squared Error (Random Forest):", MSE.lm))
## [1] "Mean Squared Error (Random Forest): 0.0000000802962922642319"
print(paste("Mean Squared Error (Neural Net):", MSE.nn))
## [1] "Mean Squared Error (Neural Net): 0.000274136341683124"

According to the comparison, Multiple Linear Regression, Polynomial Regression,additive model and Neural Net did the best job. Besides, the result of MSE also confirm the result that the Multiple Linear Regression, Polynomial Regression, additive model and Neural Net predicted approximately the same which is a little bit better than SVR.

5 Huge Package

library(huge)
wh_matrix <- data.matrix(wh_reg)
out=huge(wh_matrix, nlambda = 40, lambda.min.ratio = 0.4, method = "glasso")
## Conducting the graphical lasso (glasso) with lossless screening....in progress:0% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:2% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:5% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:7% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:9% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:12% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:15% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:17% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:19% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:22% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:25% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:27% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:30% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:32% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:35% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:37% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:40% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:42% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:44% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:47% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:50% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:52% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:55% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:57% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:60% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:62% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:65% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:67% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:70% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:72% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:75% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:77% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:80% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:82% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:85% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:87% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:90% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:92% 
Conducting the graphical lasso (glasso) with lossless screening....in progress:95% 
Conducting the graphical lasso (glasso)....done.                                          
out
## Model: graphical lasso (glasso)
## Input: The Data Matrix
## Path length: 40 
## Graph dimension: 8 
## Sparsity level: 0 -----> 0.4285714286
plot(out)

The result exactly convinces the expectation I get above that Economy is the most significant variable related to Happiness score. With the decresing of lamda, the variable Health become signifcant related to Happiness score.

6 Conclusion

After analysing data of Global Happiness Levels in the world through 2015 to 2017, created by the United Nations Sustainable Development Solutions Network, we were able to discover the impact of each different factor in determining happiness.

According to the correlation plot and scatter plot, I had also found that among the seven factors, Economy tends to have the greatest effect on happiness with Health following close by. Thus if a country expects to get a higher happiness score in future, they should pay most attention on Economy and Health of citizens.

Besides, exploring the difference of happiness scores among different regions, I find that Europe and North America people live much happier comparing with Africa people.The happiest countries were located in Europe, particularly Scandinavia and Switzerland. Meanwhile the least happy countries were located in Africa and the Middle East. This suggests that countries in close proximity or those in the same region often have similar living conditions and are thus affected by factors similarly.

Furthermore, I dig into the deeper aspect trying to find a perfect model to fit into the happiness score formula. From the results what I get in the report, we can choose Multiple Linear Regression, Polynomial Regression, additive model and Neural Net model to predict the future happiness score since they are all present excellecent in fitting the dataset.

Based on what I find in this report, we can focus on improving and keeping the most important factors related to happiness score and government can get some tips and ideas which may helpful in achieving the better future. Through this, nations are able to achieve the true happiness which humans is always striving for.